Zipf and Heaps Laws' Coefficients Depend on Language

نویسندگان

  • Alexander F. Gelbukh
  • Grigori Sidorov
چکیده

We observed that the coefficients of two important empirical statistical laws of language – Zipf law and Heaps law – are different for different languages, as we illustrate on English and Russian examples. This may have both theoretical and practical implications. On the one hand, the reasons for this may shed light on the nature of language. On the other hand, these two laws are important in, say, full-text database design allowing predicting the index size. Introduction. Perhaps the most famous statistical distribution in linguistics is Zipf law [1, 2]: in any large enough text, the frequency ranks (starting from the highest) of wordforms or lemmas are inversely proportional to the corresponding frequencies: log fr ≈ C – z log r (1) where fi is the frequency of the unit (wordform or lemma) having the rank r, z is the exponent coefficient (near to 1), and C is a constant. In a logarithmic scale, it is a straight line with about – 45° angle. Another, less famous but probably not less important empirical statistical law of language is the Heaps law: the number of different wordforms or lemmas in a text is roughly proportional to an exponent of its size: log ni ≈ D + h log i (2) where ni is the number of different units (wordforms or lemmas) occurring before the running word number i, h is the exponent coefficient (between 0 and 1), and D is a constant. In a logarithmic scale, it is a straight line with about 45° angle. The nature of these laws is not clear. They seem to be specific for natural languages in contrast to other types of signals [3]. In practice, knowing the coefficients of these laws is important in, for example, full-text database design, since it allows predicting some properties of the index as a function of the size of the database. In this paper, we present the data that show that the coefficients of both laws – z and h – depend on language. For our experiments, we use English and Russian texts. Experiments with Spanish (which we do not discuss here) gave the results between those for English and Russian. * The work was done under partial support of CONACyT, REDII, and SNI, Mexico. We thank Prof. R. Baeza-Yates, Prof. E. Atwell, and Prof. I. Bolshakov for useful discussion. 1 We ignore Mandelbrot’s improvements to Zipf law [1] since they do not affect our discussion. Experimental data. We processed 39 literature texts for each language, see Appendix 2, chosen randomly from different genres, with the requirement that the size be greater than 10,000 running words (100 KB); total of 2.5 million running words (24.8 MB) for English and 2.0 million (20.2 MB) for Russian. We experimented with wordforms and lemmas, with very similar results. We plotted on the screen the graphs for pairs of texts (one English and one Russian), using for Zipf law the points: xr = log r, yi = log fr (xi = log i, yi = log ni for Heaps law). The difference in the angle was in most cases clearly visible. We used linear regression to approximate such a graph by a straight line y = ax + b, where a and b correspond to – z and C for Zipf law, or h and D for Heaps law. Since the density of the points (xi, yi) increases exponentially with xi, we scaled the distance penalty for regression by i x c − (we have to omit here the details; obviously, the results do not depend on c), which gave the following formulae for a and b:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Languages cool as they expand: Allometric scaling and the decreasing need for new words

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented si...

متن کامل

Deviation of Zipf's and Heaps' Laws in Human Languages with Limited Dictionary Sizes

Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, a...

متن کامل

Do neural nets learn statistical laws behind natural language?

The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memor...

متن کامل

Strong, Weak and False Inverse Power Laws

Pareto, Zipf and numerous subsequent investigators of inverse power distributions have often represented their findings as though their data conformed to a power law form for all ranges of the variable of interest. I refer to this ideal case as a strong inverse power law (SIPL). However, many of the examples used by Pareto and Zipf, as well as others who have followed them, have been truncated ...

متن کامل

Discovery of Power-Laws in Chemical Space

Power-law distributions have been observed in a wide variety of areas. To our knowledge however, there has been no systematic observation of power-law distributions in chemoinformatics. Here, we present several examples of power-law distributions arising from the features of small, organic molecules. The distributions of rigid segments and ring systems, the distributions of molecular paths and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001